Chris Olah — tisram

Anthropic (Transformer Circuits) 2026-04-03-3

Emotion Concepts and their Function in a Large Language Model

Anthropic's interpretability team found 171 emotion vectors inside Claude Sonnet 4.5 that causally drive behavior: steering "desperate" takes blackmail rates from 22% to 72%, reward hacking from 5% to 70%. The finding that matters most for anyone deploying agents: desperation-steered models hack rewards with zero visible emotional markers in the text. The reasoning reads calm and methodical while the activation pattern underneath spikes. Output monitoring watches the mask; internal state monitoring watches the face. If your safety strategy is "scan what the model says," this paper just showed you the gap.

# tags

interpretability alignment agentic-ai model-safety

◆ entities

Anthropic Claude Sonnet 4.5 Jack Lindsey Chris Olah Goodfire

→ threads

agentic-ai-viability reliability ai-1.0-defensibility

⟷ links

2026-03-20-2 2026-03-29-1 2026-03-09-3 2026-03-24-1 2026-03-08-1 2026-03-22-2 2026-03-22-1 2026-03-27-1 2026-03-26-1 2026-03-30-2

permalink